gg

gg stands for grammar of graphics, based on Leland Wilkinson’s idea that plot should be structured according to certain rules, like a sentence according to grammar.

Every ggplot2 plot has three key components:

  • data

  • aestetic mappings

  • layers

We use trees dataset to explain the three components of grammar of graphics:

g <-  ggplot(data = trees, aes(x = Girth, y = Volume)) + geom_point() ## define plot g
g  ## plot the object (i.e. plot) g 

data

You need to specify data set (i.e. data frame, table, tibble, etc.) from which you use the information to present in the plot. In the above example, we use data set trees, built in into base R.

aesthetic mappings

Each plot needs at least one set of aesthetic mappings between variables in the data and visual properties. The name “aesthetic mapping” may sound confusing. Aestethic mapping just means you tell ggplot which variables from the chosen data set will be used and how. Typically, you use function aes() within ggplot() to specify role of each variable used in the plot, as ilustrated in the above example. Use parameter

  • x for variable on \(x\)-axis (in the above example, it’s variable Girth )

  • y for variable on \(y\)-axis (in the above example, it’s variable Volume)

  • color (or colour) for variable represented by color of points/markers (if categorical, you will have different colors; if numerical, you will have differents shades)

  • shape for (categorical) variable represented by shape of markers (different categories - different marker shapes)

  • size for variable represented by size of markers (larger value - larger marker size)

layers

At least one layer is required, which describes how to render each observation. Layers are often created with a function whose name starts with geom_. For example:

  • geom_point() - for scatter plots

  • geom_line() - for curves (connected dots)

  • geom_smooth() - for fitted lines/curves.

  • geom_histogram() - for frequencies of numerical variables

  • geom_bar() - for frequencies of categorical variables

Examples

library(ggplot2) ## don't forget to load ggplot2 package
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

g <-  ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() ## define plot g
g   ## print the plot g

g + geom_smooth(method="lm") + coord_cartesian(ylim=c(0,80))  ## add lin. reg. line to g

g + geom_smooth(method="lm", level=0.99) + coord_cartesian(ylim=c(0,80))  ## add lin. reg. line to g

g + geom_smooth(method="lm", se=FALSE) + coord_cartesian(ylim=c(0,80))  ## don't include conf. interval

The default gray background can be removed by adding "+ theme_bw()".

g + geom_smooth(method="lm", se=FALSE) + theme_bw()

Another example: using the data sets diamonds from ggplot2 package.

head(diamonds)  ## dataset from ggplot2, NOT diamond from UsingR
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=clarity))

Representing More than 2 Variables in a Plot

head(airquality,3)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
month = recode_factor(airquality$Month, '5'="May", '6'="June", '7'="July",
                       '8'="August", '9'="September")
air = airquality[, c("Ozone","Solar.R","Wind")]
air$Month = month
air = air[!is.na(air$Ozone) & !is.na(air$Wind),]
head(air)
  Ozone Solar.R Wind Month
1    41     190  7.4   May
2    36     118  8.0   May
3    12     149 12.6   May
4    18     313 11.5   May
6    28      NA 14.9   May
7    23     299  8.6   May

ggplot(air, aes(x = Ozone, y = Wind)) + geom_point()

Representing 3 Variables in a Plot

Apart from Wind and Ozone represented on \(x\) and \(y\) axes, we can represent the categorical variable Month by color.

ggplot(data=air, aes(x=Wind,y=Ozone,color=Month)) +  
    geom_smooth(method="lm", se=F, color="blue") + 
    geom_point(alpha=0.5) + coord_cartesian(ylim=c(0,175))

g <- ggplot(air, aes(x = Wind, y = Ozone, color=Month)) + 
          geom_point(alpha=1/2) +
          stat_smooth(method="lm", se=FALSE, fill=NA,
                      formula=y ~ poly(x, 2),color="blue") + 
          stat_smooth(method="loess", se=FALSE, color="red")
g

Representing 4 Variables in a Plot

 ggplot(air, aes(x=Wind, y=Ozone, color=Month, size=Solar.R)) + geom_point(alpha=1/2)  +
        geom_smooth(method="lm", se=F, formula=y~poly(x,2), color="blue")
Warning: Removed 5 rows containing missing values (geom_point).

Example with 5 Variables in 2D Plot

Next, let’s look at the example of mpg data set from ggplot2 package, which has data regarding car mileage (miles per gallon)

head(mpg)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

ggplot(mpg, aes(displ, hwy)) + geom_point()

Apart from displacement and highway mileage being represented on \(x\) and \(y\) axes, we can represent the categorical variable class by color.

 ggplot(mpg, aes(x = displ, y = hwy, color=class)) +  geom_point(alpha=1/2) ## + geom_smooth(method="lm")

g <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) + 
          geom_point(alpha=1/2) +
          stat_smooth(method="lm", se=FALSE, fill=NA,
                      formula=y ~ poly(x, 3),color="blue") + 
          stat_smooth(method="loess", se=FALSE, color="red")
g

“Plotlyfication” of ggplot2 graph

Package plotly has a function ggplotly() that converts ggplot2 graph into plotly graph.

library(plotly)
ggplotly(g) %>%    ## g is the graph from the previous slide
   layout(margin=list(l=200, t=60))

Apart from color representing class variable, we can also use shape to represent drv variable (4-, front-, rear-wheel drive).

ggplot(mpg, aes(displ, hwy)) + 
   geom_point(aes(color = class, shape = drv)) 

Lastly, we can use size of the markers to represent yet another variable, say, cty (city mileage). This way we use 2D plot to represent 5 variables. Also, we change the label of drv variable in the legend, into drive train.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(color = class, shape = drv, size=cty)) +
  scale_shape_discrete("drive train") +     ## name in the legend for drv variable
  scale_size_continuous("city mileage")     ## name in the legend for cty variable

Bar Plot

Here is a simple example of a bar plot, using mtcars dataset.

head(mpg)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

# For vertical barplot, don't map a variable to y
ggplot(data=mpg, aes(x=class)) +
  geom_bar(stat="count", width=0.7, fill="steelblue") +
  theme_minimal()

The following bar plots are variations on the theme from https://www.learnbyexample.org/r-bar-plot-ggplot2/

survey <- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange", "Pears"),
                     people=c(40, 50, 30, 15, 35, 20))
survey
   fruit people
1  Apple     40
2 Banana     50
3 Grapes     30
4   Kiwi     15
5 Orange     35
6  Pears     20

ggplot(survey, aes(x=fruit, y=people, fill="red")) + 
      geom_bar(stat="identity")

g <- ggplot(survey, aes(x=fruit, y=people, fill=fruit)) + 
         geom_bar(stat="identity") +  
         theme(axis.text.x = element_text(face = "bold", color = "#993333", 
                                    size = 12, angle = 60, hjust=0.8))
g

“Plotlyfication” of the fruit bar plot

library(plotly)
ggplotly(g) %>%
    layout(margin=list(l=150, t=60)) %>%
        config(displaylogo = FALSE)

Two Plots in One Figure

In the following figures we plot Tesla stock price (ticker/symbol: TSLA) at the closing of each day from a year ago until the moment this file was last rendered, which is October 6, 2020. We also plot the histogram of daily volume, i.e. number of shares that changed their owner in each day in the last 365 days. In the first figure, the two are in the same plot, while in the second they are in two separate subplots of the same figure, vertically alligned. The code is flexible in the sense that if you only change the ticker TSLA to the ticker of some other company and render this file, the Tesla graphs would be replaced by the graphs of the corresonding company.

library(ggplot2)
library(ggpubr) ## used for function ggarrange(), for ggplot subplots
library(quantmod) ## used for getting stock data from Yahoo Finance
library(timetk) ## needed for function tk_tbl(), to convert time series to tibble
ticker = "TSLA" ## change this to any other ticker 
present = Sys.time()  ## getting the current time, right now (when this file is rendering)
fromtime = present - 365*24*60*60  ## a year ago

## use getSymbols() from package quantmod to get time series (`xts` object) w/ stock data 
tick_df = getSymbols(ticker, from=fromtime, to = present, 
           src = "yahoo", auto.assign=FALSE)
'getSymbols' currently uses auto.assign=TRUE by default, but will
use auto.assign=FALSE in 0.5-0. You will still be able to use
'loadSymbols' to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.

This message is shown once per session and may be disabled by setting 
options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
Warning: 'indexClass<-' is deprecated.
Use 'tclass<-' instead.
See help("Deprecated") and help("xts-deprecated").

## create my time series with TSLA stock price time series
myts = tk_tbl(data=tick_df, rename_index = "date",
       preserve_index = T)  

## let's see the structure of the object myts (my time series)
str(myts)
Classes 'tbl_df', 'tbl' and 'data.frame':   251 obs. of  7 variables:
 $ date         : Date, format: "2019-08-19" "2019-08-20" ...
 $ TSLA.Open    : num  224 228 222 223 220 ...
 $ TSLA.High    : num  228 229 223 225 221 ...
 $ TSLA.Low     : num  222 225 218 218 211 ...
 $ TSLA.Close   : num  227 226 221 222 211 ...
 $ TSLA.Volume  : num  5309600 4125200 7794300 6559000 8538600 ...
 $ TSLA.Adjusted: num  227 226 221 222 211 ...
pricedf = getQuote(ticker)  ## getQuote() is from quantmod package 
## let's see what pricedf is 
str(pricedf)
'data.frame':   1 obs. of  8 variables:
 $ Trade Time: POSIXct, format: "2020-08-14 16:00:01"
 $ Last      : num 1651
 $ Change    : num 29.7
 $ % Change  : num 1.83
 $ Open      : num 1665
 $ High      : num 1669
 $ Low       : num 1627
 $ Volume    : int 12373787

## get the current (i.e. the last) price; 
## if the stock market is closed, it gives the price at closing
lasttime = format(pricedf$"Trade Time", tz="America/Phoenix",usetz=TRUE)

## text to be plotted on one of the graphs, with current stock price
mytext = paste(ticker," price: $",pricedf$Last, "\n", lasttime, sep="")

## using dplyr::pull() we create vectors of prices at closing
## as well as volumes (number of of shares sold/bought, i.e. that changed its owner)
closeprice = pull(.data=myts, paste(ticker,".Close",sep=""))
volumes = pull(.data=myts, paste(ticker,".Volume",sep=""))

if (closeprice[length(closeprice)]>closeprice[1]) {
   mycolor = "seagreen"
} else {mycolor = "red2"}

g = ggplot(myts, aes(x=date, y=closeprice), color=mycolor) 

k = ceiling(log10(max(volumes)/max(closeprice))) ##scaling factor; makes plot nicer

g <- g + geom_area(aes(y=volumes/10^k), alpha=0.7) + theme_bw() + 
          geom_line(color=mycolor) + 
          xlab("Time") + 
          ylab(paste("Close Price (in $)\nVolume of shares (in 10^",as.character(k),")",sep=""))
ggplotly(g)

g1 = g + geom_area(color="blue", fill=mycolor, alpha=0.5) + 
  ylab("Close Price") + theme_bw() + 
  geom_text(x=myts$date[10], y=max(closeprice),
             label=mytext, 
             size=3, lineheight=1,
             hjust="left", vjust="top")

g2 = ggplot(myts, aes(x=myts$date, y=volumes)) + 
  geom_area(aes(y=volumes), color="blue", fill="royalblue3") + theme_bw() + 
  xlab("Time") + ylab("Volume")

## use function ggarrange() from package ggpubr to make 
## two ggplots below each other
ggpubr::ggarrange(g1,g2, nrow=2, align="v")

Logarithmic scale

Dataset msleep from ggplot2 package

library(ggplot2)
head(msleep)
# A tibble: 6 x 11
  name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Chee~ Acin~ carni Carn~ lc                  12.1      NA        NA      11.9
2 Owl ~ Aotus omni  Prim~ <NA>                17         1.8      NA       7  
3 Moun~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA       9.6
4 Grea~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133   9.1
5 Cow   Bos   herbi Arti~ domesticated         4         0.7       0.667  20  
6 Thre~ Brad~ herbi Pilo~ <NA>                14.4       2.2       0.767   9.6
# ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

Logarithmic scale

g <- ggplot(msleep, aes(brainwt, bodywt)) + 
     scale_x_log10() + 
     scale_y_log10()
g

g + geom_point(aes(color = vore)) + 
    scale_color_manual(
      values = c("red", "orange", "green", "blue"), 
      na.value = "grey50"
    )
Warning: Removed 27 rows containing missing values (geom_point).

clrs = c(carni = "red", insecti = "orange", 
         herbi = "green", omni = "blue")

g + geom_point(aes(color = vore)) + 
    scale_color_manual(values = clrs)
Warning: Removed 32 rows containing missing values (geom_point).

Faceting

p <- ggplot(mpg, aes(cty, hwy)) + 
  geom_jitter(width = 0.1, height = 0.1) 
p + facet_wrap(~cyl)

mpg2 <- subset(mpg, cyl != 5 & drv %in% c("4", "f") & class != "2seater")
mpg2
# A tibble: 205 x 11
   manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
   <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
 1 audi         a4         1.8  1999     4 auto(l~ f        18    29 p     comp~
 2 audi         a4         1.8  1999     4 manual~ f        21    29 p     comp~
 3 audi         a4         2    2008     4 manual~ f        20    31 p     comp~
 4 audi         a4         2    2008     4 auto(a~ f        21    30 p     comp~
 5 audi         a4         2.8  1999     6 auto(l~ f        16    26 p     comp~
 6 audi         a4         2.8  1999     6 manual~ f        18    26 p     comp~
 7 audi         a4         3.1  2008     6 auto(a~ f        18    27 p     comp~
 8 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p     comp~
 9 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p     comp~
10 audi         a4 quat~   2    2008     4 manual~ 4        20    28 p     comp~
# ... with 195 more rows

Grouping vs. facetting

Facetting is an alternative to using aesthetics (like color, shape or size) to differentiate groups. Both techniques have strengths and weaknesses, based around the relative positions of the subsets. With facetting, each group is quite far apart in its own panel, and there is no overlap between the groups. This is good if the groups overlap a lot, but it does make small differences harder to see. When using aesthetics to differentiate groups, the groups are close together and may overlap, but small differences are easier to see.

df <- data.frame(
  x = rnorm(180, c(0, 2, 4)),
  y = rnorm(180, c(1, 2, 1)),
  z = letters[1:3]  ##from english alphabet (`base` package)
)
head(df,8)
            x          y z
1  0.25803542  0.7308333 a
2  1.72924639  3.2895742 b
3  5.47210456 -0.1804129 c
4  0.48750063  1.6734242 a
5  2.50323855  1.8514075 b
6  2.98597841  1.2579372 c
7 -0.03449611  1.4319008 a
8  0.35885873  2.5653356 b

ggplot(df, aes(x, y)) + 
  geom_point(aes(color = z), size=3, alpha=0.5)

ggplot(df, aes(x, y)) + 
  geom_point() + 
  facet_wrap(~z)

Group comparison by showing all the group means in each panel:

df_sum <- df %>% 
       group_by(z) %>% 
       summarize(x = mean(x), y = mean(y)) %>%
       rename(z2 = z)
ggplot(df, aes(x, y)) + geom_point() + 
              geom_point(data = df_sum, aes(color = z2), size = 4) + 
              facet_wrap(~z)

Group comparison by showing all the data in the background of each panel:

df2 <- dplyr::select(df, -z)

ggplot(df, aes(x, y)) + 
  geom_point(data = df2, color = "grey70", size=3, alpha=0.4) +
  geom_point(aes(color = z), size=3) + 
  facet_wrap(~z)

Maps

Here are some maps

## data set USArrests from base R
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

The data represent arrest rates per 100,000 people. For example, for Arizona, murder=8.1 means 8.1 arrests for murder per 100,000 people.

df <- data.frame(murder = USArrests$Murder, state = tolower(rownames(USArrests)))
map <- ggplot2::map_data("state")
m <- ggplot(data=df, aes(fill = murder)) +
        geom_map(aes(map_id = state), map = map) +
        expand_limits(x = map$long, y = map$lat)
ggplotly(m) %>%   ## from plotly package
     layout(xaxis=list(title=""), yaxis=list(title=""))  ## from plotly package

Arizona Map

az_counties <- map_data("county", "arizona") %>% 
   select(lon = long, lat, group, id = subregion)
head(az_counties, 10)
         lon      lat group     id
1  -109.0453 35.99894     1 apache
2  -109.0511 34.95043     1 apache
3  -109.0511 34.95043     1 apache
4  -109.0511 34.57227     1 apache
5  -109.0568 33.77586     1 apache
6  -109.1656 33.77586     1 apache
7  -109.2917 33.77586     1 apache
8  -109.3261 33.78159     1 apache
9  -109.3490 33.77013     1 apache
10 -109.3605 33.75294     1 apache

ggplot(az_counties, aes(lon, lat, group = group)) +
   geom_polygon(fill = "white", color = "blue") + 
   coord_quickmap() + theme_classic() + 
   theme(axis.ticks=element_blank(),  ## don't show ticks
         axis.text=element_blank(),   ## don't show tick labels/text
         axis.line=element_blank(),   ## don't show line
         axis.title=element_blank())  ## don't show axis name/title

Here we combine plot_usmap() from usmap package, and scale_fill_continuous() and theme() from ggplot2 package:

library(usmap)
plot_usmap(data = statepop, values = "pop_2015", color = "white") + 
      scale_fill_continuous(name = "Population (2015)", label = scales::comma) + 
      theme(legend.position = "right")

m <-  plot_usmap(data = statepop, values = "pop_2015", color = "white") + 
         scale_fill_gradientn(name = "Population (2015)", colors = rev(rainbow(7))) +
         theme(legend.position = "right")
ggplotly(m)

usmap::plot_usmap(regions="counties", include = "Arizona")